Chapter 3

字典和集合

常见的字典方法
如何处理查找不到的键
标准库中 dict 类型的变种
set 和 frozenset 类型
散列表的工作原理
散列表带来的潜在影响（什么样的数据类型可作为键、不可预知的顺序，等等）

3.1 泛映射类型

3.2 字典推导

字典推导（dictcomp）可以从任何以键值对作为元素的可迭代对象中构建出字典。

例 3-1： 字典推导。利用字典推导可以把一个装满元组的列表变成两个不同的字典。

DIAL_CODES = [
	(86, "China"),
	(91, "India"),
	(1, 'United States'),
	(62, 'Indonesia'),
	(55, 'Brazil'),
	(92, 'Pakistan'),
	(880, 'Bangladesh'),
	(234, 'Nigeria'),
	(7, 'Russia'),
	(81, 'Japan')
]

country_code = {country: code for code, country in DIAL_CODES}
print(country_code)  # 4
""" # 4 输出内容
{'China': 86, 'India': 91, 'Bangladesh': 880, 'United States': 1,
'Pakistan': 92, 'Japan': 81, 'Russia': 7, 'Brazil': 55, 'Nigeria':
234, 'Indonesia': 62}
"""
country_code_l = {code:country.upper() for conutry, code in country_code.items()
				 if code < 66}
print(country_code_l)  # 5
""" # 5 输出内容。输出 code 小于 66 的国家（if code<66）且国家名大写（country.upper()）
{1: 'UNITED STATES', 55: 'BRAZIL', 62: 'INDONESIA', 7: 'RUSSIA'}
"""

3.3 常见的映射方法

常见的映射方法（即字典）：dict、defaultdict、OrderedDict。defaultdict 和 OrderedDict 在 collections 模块内。

用 setdefault 处理找不到的键

例 3-2 ： 获取单词出现的频率信息，并把它们写进对应的列表里

import sys
import re

read_file = r"F:\WorkSpace\Python\FluentPython\Chapter 3\3-3.txt"  

WORD_RE = re.compile(r"\w+")

index = {}

with open(read_file, encoding="utf-8") as fp:
	for line_no, line in enumerate(fp, 1):
		for match in WORD_RE.finditer(line):
			word = match.group()
			column_no = match.start() + 1
			location = (line_no, column_no)
			occurrences = index.get(word, [])  # 1
			occurrences.append(location) # 2
			index[word] = occurrences
for word in sorted(index, key=str.upper):
	print(word, index[word])

上例中， # 1, # 2 可以用 setdefault 方法进行优化。

例 3-3： 优化例 3-2

import re  
import sys  
  
read_file = r"F:\WorkSpace\Python\FluentPython\Chapter 3\3-3.txt"  
  
WORD_RE = re.compile(r"\w+")  
  
index = {}  
with open(read_file, encoding="utf-8") as fp:  
    for line_no, line in enumerate(fp, 1):  
        for match in WORD_RE.finditer(line):  
            word = match.group()  
            column_no = match.start() + 1  
            location = (line_no, column_no)  
            index.setdefault(word, []).append(location)  # 3
  
for word in sorted(index, key=str.upper):  
    print(word, index[word])

例 3-2 中的 # 1 和 # 2 优化为例 3-3 中的 # 3，不仅是行数变少，更重要的是查找键的次数变少。优化前，至少要进行两次键查询，如果键不存在的话，就是三次。优化后（用 setdefault）只需要一次就可以完成整个操作。

3.4 映射的弹性键查询

映射的弹性键查询：某个键在映射里不存在，我们也希望在通过这个键读取值的时候能得到一个默认值。

总结一下，弹性键查询，就是字典中的键不存在时，需要给这个键一个默认的值。

两种实现方法：一个是用 defaultdict 类而不是普通的 dict；另一个是给自己定义一个 dict 的子类，然后在子类中实现 __missing__ 方法。

3.4.1 弹性键查询之 `defaultdict`

defaultdict 执行步骤。例如 dd = defaultdict(list)，当执行 dd['new-key'] 时会按如下步骤进行：

调用 list() 来创建一个新列表；
把这个新列表作为值，new-key 作为其对应的键，放到 dd 中；
返回这个列表的引用。

例 3-4： 进一个优化例 3-3。用 defaultdict 代替 setdefault

import sys
import re
import collections

read_file = r"F:\WorkSpace\Python\FluentPython\Chapter 3\3-3.txt"

WORD_RE = re.compile(r"\w+")

index = collections.defaultdict(list)
with open(read_file, encoding="utf-8") as fp:
	for line_no, line in enumerate(fp, 1):
		for match in WORD_RE.finditer(line):
			word = match.group()
			column_no = match.start()+1
			location = (line_no, column_no)
			index[word].append(location)
for word in sorted(index, key=str.upper):
	print(word, index[word])

3.4.2 弹性键查询之 `missing`

3.5 集合

集合的本质是许多唯一对象的聚焦，所以，可以用集合去重。

l = ['spam', 'spam', 'eggs', 'spam']
ll = set(l)
print(ll)  # {'eggs', 'spam'}
lll = list(ll)
print(lll)  # ['eggs', 'spam']

集合还实现了很多基础的中缀运算符，如 |(或)、&(与)、-(差集)。

例 3-5-1： 有一个电子邮件地址的集合（haystack），还要维护一个较小的电子邮件地址集合（needles），然后求出 needles 中有多少地址同时也出现在了 heystack 里。使用集合操作，一行就可以。

found = len(needles & haystack)

例 3-5-2： 不用集合完成 例3-5-1

found = 0
for n in needles:
	for n in haystack:
		found += 1

对比 例 3-5-1 和 例 3-5-2：前者（例 3-5-1）比后者（例 3-5-2）要快，但要求两个操作对象都是集合；后者虽然慢一些，但可以用在任何可迭代的对象上。

例 3-5-3：needles 的元素在 haystack 里出现的次数。

found = len(set(needles) & set(haystack))

# 另一种写法
found = len(set(needles).intersection(haystack))

因为对两个操作对象加上了 set 函数，所以可以用在任何可迭代的对象上。

例 3-5-4： 新建一个 Latin-1 字符集合，该集合里的每个字符的 Unicode 名字里都有 SIGN 这个单词

from unicodedata import name
Latin_1 = {chr(i) for i in range(32, 256) if "SIGN" in name(chr(i), "")}
print(Latin_1)